In [ ]:

    
from IPython.display import Image

Introduction to NumPy and Pandas

Introduction to NumPy

most fundamental third-party package for scientific computing in Python
multidimensional array data structures
associated functions and methods to manipulate them.
Other third-party packages, including pandas, use NumPy arrays as backends for more specialized data structures

Comparison to Python

While Python comes with several container types (list,tuple,dict),
NumPy's arrays are implemented closer to the hardware, and are therefore more efficient than the built-in types.
This is particularly true for large data, for which NumPy scales much better than Python's built-in data structures.
NumPy arrays also retain a suite of associated functions and methods that allow for efficient array-oriented computing.

Import Convention

By convention numpy is imported



In [ ]:

    
import numpy as np

NumPy Arrays and Indexing

You can index an array in the same way you can index Python lists using slice notation



In [ ]:

    
lst = list(range(1000))
arr = np.arange(1000)

Here's what the array looks like



In [ ]:

    
arr[:10]



In [ ]:

    
arr[10:20]



In [ ]:

    
arr[10:20:2]



In [ ]:

    
type(arr)



In [ ]:

    
%timeit [i ** 2 for i in lst]



In [ ]:

    
%timeit arr ** 2

We can index arrays in the same ways as lists



In [ ]:

    
arr[5:10]



In [ ]:

    
arr[-1]

Arrays vs Lists

arrays are homogeneously typed
- all elements of an array must be of the same type.
- we see why when we think about the memory layout
lists can contain elements of arbitrary type



In [ ]:

    
['a', 2, (1, 3)]



In [ ]:

    
lst[0] = 'some other type'



In [ ]:

    
lst[:3]

We can't do this with an array



In [ ]:

    
arr[0] = 'some other type'

The data type is contained in the dtype attribute



In [ ]:

    
arr.dtype

The dtype is fixed
Other types will be cast to this type



In [ ]:

    
arr[0] = 1.234



In [ ]:

    
arr[:10]

What is an Array

Sometimes it's useful to peak under the hood to fix ideas
A block of memory with some extra information on how to intepret its contents



In [ ]:

    
Image("https://docs.scipy.org/doc/numpy/_images/threefundamental.png")

Array Creation



In [ ]:

    
np.zeros(5, dtype=float)



In [ ]:

    
np.zeros(5, dtype=int)



In [ ]:

    
np.zeros(5, dtype=complex)



In [ ]:

    
np.ones(5, dtype=float)

We have seen how the arange function generates an array for a range of integers.
linspace and logspace functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:



In [ ]:

    
np.linspace(0, 1, num=5)



In [ ]:

    
np.logspace(1, 4, num=4)

Random Number Generation

Finally, it is often useful to create arrays with random numbers that follow a specific distribution. The np.random module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1) $ X \sim N(0, 1) $:

$$f(x \mid \mu=0, \sigma=1) = \sqrt{\frac{1}{2\pi \sigma^2}} \exp\left\{ -\frac{x^2}{2\sigma^2} \right\}$$



In [ ]:

    
np.random.randn(5)

$X \sim N(9, 3)$



In [ ]:

    
norm10 = np.random.normal(loc=9, scale=3, size=10)

Exercise: Random numbers

Generate a NumPy array of 1000 random numbers sampled from a Poisson distribution, with parameter lam=5. What is the modal value in the sample?



In [ ]:

    
%load solutions/random_number.py

Index Arrays

Above we showed how to index with numbers and slices
NumPy indexing is much more powerful than Python indexing
You can index with other arrays
- Boolean arrays
- Integer arrays

Consider for example that in the array norm10 we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is True or False:

Boolean Indexing



In [ ]:

    
mask = norm10 > 9
mask



In [ ]:

    
norm10[mask]

Integer Indexing

Likewise you can index with integer arrays



In [ ]:

    
norm10[[1, 4, 6]]

Asssignment

This form of indexing is known as fancy-indexing
You can use fancy-indexing for assignment
- This is particularly useful for assignment given some condition



In [ ]:

    
norm10[norm10 > 9] = 0



In [ ]:

    
norm10



In [ ]:

    
norm10[[1, 4, 7]] = 10



In [ ]:

    
norm10

Copies vs Views

This is a common gotcha for people new to NumPy
While lvalue fancy-indexing in the case of assignment does not copy
- Just __setitem__
rvalue fancy-indexing produces a copy not a view
- __getitem__ followed by __setitem__

When we use slice notation to look at part of an array, it produces a view
That is, it points to the same memory of the original array



In [ ]:

    
x = np.arange(10)



In [ ]:

    
x



In [ ]:

    
y = x[::2]
y



In [ ]:

    
y[3] = 100
y



In [ ]:

    
x

This, however, produces a copy
Operating on the copy will not affect the original array



In [ ]:

    
a = norm10[[0, 1, 5]]



In [ ]:

    
a



In [ ]:

    
a[:] = -10



In [ ]:

    
a



In [ ]:

    
norm10

Exercise

Create an array [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] without typing the values by hand. Assign 100 to elements 2 to 5 (zero-index). Print the array.

Create the same array as in step one above. Create an array from a slice of elements 2 to 5. Assign 100 to the slice. Hint try [:] to address all of the elements of an array. Print the original array and the slice.



In [ ]:

    
# [Solution here]



In [ ]:

    
%load solutions/copies_vs_views.py

Multidimensional Arrays

NumPy can create arrays of aribtrary dimensions, and all the methods illustrated in the previous section work with more than one dimension.
For example, a list of lists can be used to initialize a two dimensional array:



In [ ]:

    
samples_list = [[632, 1638, 569, 115], [433,1130,754,555]]
samples_array = np.array(samples_list)
samples_array.shape



In [ ]:

    
print(samples_array)

With two-dimensional arrays we start seeing the convenience of NumPy data structures: while a nested list can be indexed across dimensions using consecutive [ ] operators, multidimensional arrays support a more natural indexing syntax with a single set of brackets and a set of comma-separated indices:



In [ ]:

    
samples_list[0][1]



In [ ]:

    
samples_array[0,1]

Most of the array creation functions listed above can be passed multidimensional shapes. For example:



In [ ]:

    
np.zeros((2,3))



In [ ]:

    
np.random.normal(10, 3, size=(2, 4))

In fact, an array can be reshaped at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is via the array's reshape method.



In [ ]:

    
arr = np.arange(8).reshape(2,4)
arr

With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions (using the same array as above):



In [ ]:

    
arr[1, 2:4]



In [ ]:

    
arr[:, 2]

If you only provide one index, then you will get the corresponding row.



In [ ]:

    
arr[1]

Now that we have seen how to create arrays with more than one dimension, it's a good idea to look at some of the most useful properties and methods that arrays have. The following provide basic information about the size, shape and data in the array:



In [ ]:

    
print('Data type                :', samples_array.dtype)
print('Total number of elements :', samples_array.size)
print('Number of dimensions     :', samples_array.ndim)
print('Shape (dimensionality)   :', samples_array.shape)
print('Memory used (in bytes)   :', samples_array.nbytes)

Arrays also have many useful methods, some especially useful ones are:



In [ ]:

    
print('Minimum and maximum             :', samples_array.min(), samples_array.max())
print('Sum, mean and standard deviation:', samples_array.sum(), samples_array.mean(), samples_array.std())

For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the axis parameter; for example:



In [ ]:

    
samples_array.sum(axis=0)



In [ ]:

    
samples_array.sum(axis=1)

Notice that summing over the rows returned a 1d array above.
If you want to preserve the dimensions use the keepdims keyword



In [ ]:

    
samples_array.sum(axis=1, keepdims=True)

Another widely used property of arrays is the .T attribute, which allows you to access the transpose of the array:



In [ ]:

    
samples_array.T

There is a wide variety of methods and properties of arrays.



In [ ]:

    
[attr for attr in dir(samples_array) if not attr.startswith('__')]

What is a Multi-Dimensional Array

memory is a linear address space
by adding information on shape and strides we can interpet bytes laid out linearly in memory as a multidimensional object



In [ ]:

    
Image('https://ipython-books.github.io/images/layout.png')

Exercises: Matrix Creation

Generate the following structure as a numpy array, without typing the values by hand. Then, create another array containing just the 2nd and 4th rows.

    [[1,  6, 11],
     [2,  7, 12],
     [3,  8, 13],
     [4,  9, 14],
     [5, 10, 15]]



In [ ]:

    
%load solutions/matrix_creation.py

Array Operations, Methods, and Functions



In [ ]:

    
sample1 = np.array([632, 1638, 569, 115])
sample2 = np.array([433,1130,754,555])

sample_sum = sample1 + sample2



In [ ]:

    
np.array([632, 1638, 569, 115])

This includes the multiplication operator -- it does not perform matrix multiplication, as is the case in Matlab, for example:



In [ ]:

    
print('{0} X {1} = {2}'.format(sample1, sample2, sample1 * sample2))

In Python 3.5, you can use the @ operator to get the inner product (or matrix multiplication) (!)



In [ ]:

    
print('{0} . {1} = {2}'.format(sample1, sample2, sample1 @ sample2))

this implies that the dimension of the arrays for each operation must match in size,
numpy will broadcast dimensions when possible
For example, suppose that you want to add the number 1.5 to each element arr1
We achieve this by broadcasting



In [ ]:

    
sample1 + 1.5

In this case, numpy looked at both operands and saw that the first was a one-dimensional array of length 4 and the second was a scalar, considered a zero-dimensional object. The broadcasting rules allow numpy to:

create new array of length 1
extend the array to match the size of the corresponding array

So in the above example, the scalar 1.5 is effectively cast to a 1-dimensional array of length 1, then stretched to length 4 to match the dimension of arr1. After this, element-wise addition can proceed as now both operands are one-dimensional arrays of length 4.

This broadcasting behavior is powerful, especially because when NumPy broadcasts to create new dimensions or to stretch existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This saves memory and improves the performance of operations.

When broadcasting, NumPy compares the sizes of each dimension in each operand. It starts with the trailing dimensions, working forward and creating dimensions as needed to accomodate the operation. Two dimensions are considered compatible for operation when:

they are equal in size
one is scalar (or size 1)

If these conditions are not met, an exception is thrown, indicating that the arrays have incompatible shapes.



In [ ]:

    
sample1 + np.array([7,8])



In [ ]:

    
b = np.array([10, 20, 30, 40])

bcast_sum = sample1 + b



In [ ]:

    
print('{0}\n\n+ {1}\n{2}\n{3}'.format(sample1, b, '-'*21, bcast_sum))



In [ ]:

    
c = np.array([-100, 100])
sample1 + c

Remember that matching begins at the trailing dimensions. Here, c would need to have a trailing dimension of 1 for the broadcasting to work. We can augment arrays with dimensions on the fly, by indexing it with a np.newaxis object, which adds an "empty" dimension:



In [ ]:

    
cplus = c[:, np.newaxis]
cplus



In [ ]:

    
cplus.shape



In [ ]:

    
sample1 + cplus



In [ ]:

    
sample1[:, np.newaxis] + c

Exercises: Array Manipulation

Divide each column of the array:

a = np.arange(25).reshape(5, 5)

elementwise with the array

b = np.array([1., 5, 10, 15, 20])



In [ ]:

    
# [Solution here]



In [ ]:

    
%load solutions/broadcasting.py

What Else

NumPy provides much more functionality than what we covered here
For example, facilities for linear algebra, FFTs, polynomials, and unit testing for floating point

Introduction to Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

Tabular data with heterogeneously-typed columns, as you might find in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data with row and column labels

Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.

Key features

Easy handling of missing data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes
Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Import Convention



In [ ]:

    
import pandas as pd

Pandas Series

A pandas Series is a generationalization of 1d numpy array
A series has an index that labels each element in the vector.
A Series can be thought of as an ordered key-value store.



In [ ]:

    
counts = pd.Series([632, 1638, 569, 115])
counts

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series, while the index is a pandas Index object.



In [ ]:

    
counts.values

Index Object

Pandas provides a labeled index to access the rows



In [ ]:

    
counts.index

We can assign meaningful labels to the index, if they are available:



In [ ]:

    
bacteria = pd.Series([632, 1638, 569, 115], 
                     index=['Firmicutes', 'Proteobacteria', 
                            'Actinobacteria', 'Bacteroidetes'])

bacteria

NumPy's math functions and other operations can be applied to Series without losing the data structure.



In [ ]:

    
np.log(bacteria)

Creation from a dict
Returned in key-sorted order



In [ ]:

    
bacteria_dict = {
    'Firmicutes': 632, 
    'Proteobacteria': 1638,
    'Actinobacteria': 569, 
    'Bacteroidetes': 115
}

pd.Series(bacteria_dict)

Pandas DataFrames

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet.



In [ ]:

    
data = pd.DataFrame({'value': [632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient': [1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum': ['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
                                'Bacteroidetes', 'Firmicutes', 'Proteobacteria',
                                'Actinobacteria', 'Bacteroidetes']})
data

We often will want to peak at the first few rows of a DataFrame
You can use head to do this



In [ ]:

    
data.head()

Columns as an Index

The first axis of a DataFrame also has an index that represent the labeled columns



In [ ]:

    
data.columns

Reading and Writing Files

Pandas provides sophisticated I/O functionality
read_csv is a highly optimized csv reader



In [ ]:

    
vessels = pd.read_csv("../data/AIS/vessel_information.csv")
vessels.head()

Exercises

Read a single file ../data/NationalFoodSurvey/NFS_1974.csv



In [ ]:

    
%load solutions/read_nfs_1974.py